This project is to use Benford’s law to examine the statistics of aviation traffic data. Every few months, department of transportation of United States will release the data of aviation traffic, which is provided by each airline. I am interested in finding out whether the data are “true” or not. Particularly, I selected four variables that are distinctive but also correlated with each other. They are number of available seats, number of passengers, the distance of the flight and the airtime.
Benford’s law is a phenomenological law also called the first digit law. This law states that in the listings, tables of statistics, etc., the digit 1 tends to occur with probability of 30%, greater than the expected of 11% (i.e., one out of nine). The mathmatical form is like below.
\[Prob(D_1=d)=log_{10}(1+\frac1d) \; for\;d =1,2,...9;\] Here is a probability distribution table and a bar plot from 1 to 9.
| Probability | number |
|---|---|
| 0.3010 | 1 |
| 0.1760 | 2 |
| 0.1250 | 3 |
| 0.0970 | 4 |
| 0.0792 | 5 |
| 0.0669 | 6 |
| 0.0580 | 7 |
| 0.0512 | 8 |
| 0.0458 | 9 |
The data that constitutes this research is from Bureau of Transportation Statistics(BTS) and can be found on this website https://www.bts.gov. The job of BTS is to collect and compile the data; however, they can not guarante that the data provided by the airline companies are accurate. Therefore, this project is trying to identify the suspecious data entry.
We will first do some visualization of our data by looking at the largest 10 airlines.
## Warning: Missing column names filled in: 'X1' [1]
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
Let’s take a look at 2018 May data. We can see that out of top 10 airlines, top 3 airline companies have numbers begin with 1. Is this a conincidence or not?
This is the animation of all the airlines with y axis represents the sum of total airtime and x axis represents the sum of travelled distance.
Now we will examine all the airlines by using Benford’s law and find out the distribution. This plot gives an explanation of the Benford’s over all of the data. For the top left figure, we can see a spike of value 1. This is probably because our data is consisted of the last 3 years history. For example, Southwest Airlines Co. has a monthly sum of passengers flucturating around one million in the last three years.
Overall, our sum of available seats data follows a distribution of Benford’s law.
| UNIQUE_CARRIER_NAME | YEAR | MONTH | sum_seats | sum_passengers | sum_distance | sum_airtime | Date |
|---|---|---|---|---|---|---|---|
| 40-Mile Air | 2015 | 7 | 381 | 105 | 2712 | 8163 | 2015-07-01 |
| ACM AIR CHARTER GmbH | 2015 | 11 | 58 | 10 | 6500 | 0 | 2015-11-01 |
| Aerolitoral | 2015 | 4 | 297 | 207 | 415 | 0 | 2015-04-01 |
| Aeromexico | 2015 | 11 | 160 | 109 | 258 | 0 | 2015-11-01 |
| Air Alsie A/S | 2015 | 4 | 10 | 2 | 458 | 0 | 2015-04-01 |
Let’s check some other variables such as distance, passengers and airtime.